{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "# Generate Credit Card Transactions\n",
    "**This notebook generates credit card transactions and randomly injects fraud chain attacks.**\n",
    "\n",
    "---\n",
    "\n",
    "## Contents\n",
    "\n",
    "1. [Background](#Background)\n",
    "1. [Setup](#Setup)\n",
    "1. [Generate Transactions](#Generate-Transactions)\n",
    "1. [Inject Fradulent Transactions](#Inject-Fradulent-Transactions)\n",
    "1. [Save Generated Data](#Save-Generated-Data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "### Background\n",
    "<br>\n",
    "<div style=\"text-align: justify\">This notebook generates random credit card transactions for 10K users over a period of 5 months. In an ideal scenario, these historical transactions would be accumulated into a data lake/store for batch processing so as to derive insights and analytics about this data. Credit card numbers can be bought in bulk on the dark web through previous leaks or hacks of organizations that store this sensitive data. Fraudsters will buy these card lists and attempt to make as many transactions as possible with the stolen numbers until the card is blocked. These fraud chain attacks typically happen in a short time frame and can be easily spotted amongst historical transactions. This is because the velocity of transactions during the attack significantly differs from that of cardholder’s usual spending pattern. This notebook is optional to run. The generated data already exists in the ./DATA folder for you to use. Re-run this notebook if you desire to re-populate fresh data or understand the whole process of how this dataset was generated.</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "### Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "#### Prerequisites "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting Faker\n",
      "  Downloading Faker-5.0.1-py3-none-any.whl (1.1 MB)\n",
      "\u001b[K     |████████████████████████████████| 1.1 MB 28.2 MB/s eta 0:00:01\n",
      "\u001b[?25hCollecting text-unidecode==1.3\n",
      "  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)\n",
      "\u001b[K     |████████████████████████████████| 78 kB 15.3 MB/s  eta 0:00:01\n",
      "\u001b[?25hRequirement already satisfied: python-dateutil>=2.4 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from Faker) (2.8.1)\n",
      "Requirement already satisfied: six>=1.5 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from python-dateutil>=2.4->Faker) (1.14.0)\n",
      "Installing collected packages: text-unidecode, Faker\n",
      "Successfully installed Faker-5.0.1 text-unidecode-1.3\n",
      "\u001b[33mWARNING: You are using pip version 20.0.2; however, version 20.3.1 is available.\n",
      "You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!pip install Faker"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "#### Imports "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "from botocore.client import ClientError\n",
    "from collections import defaultdict\n",
    "from faker import Faker\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import sagemaker\n",
    "import datetime\n",
    "import hashlib\n",
    "import random\n",
    "import boto3\n",
    "import math\n",
    "import os"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "#### Seed for Reproducibility"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "faker = Faker()\n",
    "faker.seed_locale('en_US', 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "SEED = 123\n",
    "random.seed(SEED)\n",
    "np.random.seed(SEED)\n",
    "faker.seed_instance(SEED)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "#### Constants "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "TOTAL_UNIQUE_TRANSACTIONS = 5400000 # 5.4 Million\n",
    "TOTAL_UNIQUE_USERS = 10000\n",
    "BUCKET = sagemaker.Session().default_bucket()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "### Generate Transactions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "#### Generate Unique Credit Card Numbers \n",
    "<p> Credit card numbers are uniquely assigned to users. Since, there are 10K users, we would want to generate 10K unique card numbers.</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "def generate_unique_credit_card_numbers(n: int) -> list:\n",
    "    cc_ids = set()\n",
    "    for _ in range(n):\n",
    "        cc_id = faker.credit_card_number(card_type='visa')\n",
    "        cc_ids.add(cc_id)\n",
    "    return list(cc_ids) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "credit_card_numbers = generate_unique_credit_card_numbers(TOTAL_UNIQUE_USERS)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "assert len(credit_card_numbers) == 10000 \n",
    "assert len(credit_card_numbers[0]) == 16 # validate if generated number is 16-digit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['4027234631445034',\n",
       " '4216635904111650',\n",
       " '4143031229225235',\n",
       " '4389330834474284',\n",
       " '4310264246964414']"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# inspect random sample of credit card numbers \n",
    "random.sample(credit_card_numbers, 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "#### Generate Time Series\n",
    "<p>Generate 5.4 Million random timestamps spread across a period of 5 months (2020-01-01 to 2020-06-01) in sorted order.</p>\n",
    "<b>Note:</b> The timestamps are NOT unique themselves. We can have 2 or more transactions occurring at the same time coming from different users. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "def generate_timestamps(n: int) -> list:\n",
    "    start = datetime.datetime.strptime('2020-01-01 00:00:00', '%Y-%m-%d %H:%M:%S')\n",
    "    end = datetime.datetime.strptime('2020-06-01 00:01:00', '%Y-%m-%d %H:%M:%S')\n",
    "    timestamps = list()\n",
    "    for _ in range(n):\n",
    "        timestamp = faker.date_time_between(start_date=start, end_date=end, tzinfo=None).strftime('%Y-%m-%d %H:%M:%S')\n",
    "        timestamps.append(timestamp)\n",
    "    timestamps = sorted(timestamps)\n",
    "    return timestamps"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "timestamps = generate_timestamps(TOTAL_UNIQUE_TRANSACTIONS)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "assert len(timestamps) == TOTAL_UNIQUE_TRANSACTIONS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['2020-01-26 09:57:57',\n",
       " '2020-01-09 23:58:11',\n",
       " '2020-03-30 11:17:51',\n",
       " '2020-05-06 14:10:53',\n",
       " '2020-05-12 18:05:20']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# inspect random sample of timestamps\n",
    "random.sample(timestamps, 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "#### Generate Random Transaction Amounts \n",
    "<p>The transaction amounts are presumed to follow Pareto distribution, as it is logical for consumers to make many more smaller purchases than large ones. The break down of the distribution is shown in the table below.</p>\n",
    "\n",
    "\n",
    "| Percentage        | Range (Amount in $)     |\n",
    "| :-------------: | :----------: |\n",
    "|  5\\% | 0.01 to 1    |\n",
    "| 7.5\\%   | 1 to 10 |\n",
    "| 52.5\\%   | 10 to 100 |\n",
    "| 25\\%   | 100 to 1000 |\n",
    "| 10\\%   | 1000 to 10000 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "def get_random_transaction_amount(start: float, end: float) -> float:\n",
    "    amt = round(np.random.uniform(start, end), 2)\n",
    "    return amt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "distribution_percentages = {0.05: (0.01, 1.01), \n",
    "                            0.075: (1, 11.01),\n",
    "                            0.525: (10, 100.01),\n",
    "                            0.25: (100, 1000.01),\n",
    "                            0.10: (1000, 10000.01)}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "amounts = []\n",
    "\n",
    "for percentage, span in distribution_percentages.items():\n",
    "    n = int(TOTAL_UNIQUE_TRANSACTIONS * percentage)\n",
    "    start, end = span\n",
    "    for _ in range(n):\n",
    "        amounts.append(get_random_transaction_amount(start, end+1))\n",
    "        \n",
    "random.shuffle(amounts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "assert len(amounts) == TOTAL_UNIQUE_TRANSACTIONS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[2700.62, 0.3, 79.26, 466.93, 63.75]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# inspect random sample of transaction amounts\n",
    "random.sample(amounts, 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "#### Generate Credit Card Transactions\n",
    "<br>\n",
    "<div style=\"text-align: justify\">\n",
    "Using the random credit card numbers, timestamps and transaction amounts generated in the above steps, \n",
    "we can generate random credit card transactions by combining them. The transaction id for the transaction is the md5\n",
    "hash of the above mentioned entities.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "def generate_transaction_id(timestamp: str, credit_card_number: str, transaction_amount: float) -> str:\n",
    "    hashable = f'{timestamp}{credit_card_number}{transaction_amount}'\n",
    "    hexdigest = hashlib.md5(hashable.encode('utf-8')).hexdigest()\n",
    "    return hexdigest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "transactions = []\n",
    "for timestamp, amount in zip(timestamps, amounts):\n",
    "    credit_card_number = random.choice(credit_card_numbers)\n",
    "    transaction_id = generate_transaction_id(timestamp, credit_card_number, amount)\n",
    "    transactions.append({'tid': transaction_id, \n",
    "                         'datetime': timestamp, \n",
    "                         'cc_num': credit_card_number, \n",
    "                         'amount': amount, \n",
    "                         'fraud_label': 0})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "assert len(transactions) == TOTAL_UNIQUE_TRANSACTIONS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'tid': '96fd421a79b8861a8fe82472b540d33e',\n",
       "  'datetime': '2020-02-26 19:43:46',\n",
       "  'cc_num': '4004143260155276',\n",
       "  'amount': 81.6,\n",
       "  'fraud_label': 0}]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# inspect random sample of credit card transactions\n",
    "random.sample(transactions, 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "### Inject Fradulent Transactions\n",
    "<p> A typical fraud chain looks like the one as shown in the image below.</p>\n",
    "<img src=\"./images/fraud_pattern.png\" />"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "FRAUD_RATIO = 0.0025 # percentage of transactions that are fraudulent\n",
    "NUMBER_OF_FRAUDULENT_TRANSACTIONS = int(FRAUD_RATIO * TOTAL_UNIQUE_TRANSACTIONS)\n",
    "ATTACK_CHAIN_LENGTHS = [3, 4, 5, 6, 7, 8, 9, 10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "#### Create Transaction Chains "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "visited = set()\n",
    "chains = defaultdict(list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "def size(chains: dict) -> int:\n",
    "    counts = {key: len(values)+1 for (key, values) in chains.items()}\n",
    "    return sum(counts.values())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "def create_attack_chain(i: int):\n",
    "    chain_length = random.choice(ATTACK_CHAIN_LENGTHS)\n",
    "    for j in range(1, chain_length):\n",
    "        if i+j not in visited:\n",
    "            if size(chains) == NUMBER_OF_FRAUDULENT_TRANSACTIONS:\n",
    "                break\n",
    "            chains[i].append(i+j)\n",
    "            visited.add(i+j)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "while size(chains) < NUMBER_OF_FRAUDULENT_TRANSACTIONS:\n",
    "    i = random.choice(range(TOTAL_UNIQUE_TRANSACTIONS))\n",
    "    if i not in visited:\n",
    "        create_attack_chain(i)\n",
    "        visited.add(i)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "assert size(chains) == NUMBER_OF_FRAUDULENT_TRANSACTIONS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "#### Modify Transactions with Fraud Chain Attacks "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "def generate_timestamps_for_fraud_attacks(timestamp: str, chain_length: int) -> list:\n",
    "    timestamps = []\n",
    "    timestamp = datetime.datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S')\n",
    "    for _ in range(chain_length):\n",
    "        # interval in seconds between fraudulent attacks\n",
    "        delta = random.randint(30, 120)\n",
    "        current = timestamp + datetime.timedelta(seconds=delta)\n",
    "        timestamps.append(current.strftime('%Y-%m-%d %H:%M:%S'))\n",
    "        timestamp = current\n",
    "    return timestamps "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "def generate_amounts_for_fraud_attacks(chain_length: int) -> list:\n",
    "    amounts = []\n",
    "    for percentage, span in distribution_percentages.items():\n",
    "        n = math.ceil(chain_length * percentage)\n",
    "        start, end = span\n",
    "        for _ in range(n):\n",
    "            amounts.append(get_random_transaction_amount(start, end+1))\n",
    "    return amounts[:chain_length]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "for key, chain in chains.items():\n",
    "    transaction = transactions[key]\n",
    "    timestamp = transaction['datetime']\n",
    "    cc_num = transaction['cc_num']\n",
    "    amount = transaction['amount']\n",
    "    transaction['fraud_label'] = 1\n",
    "    inject_timestamps = generate_timestamps_for_fraud_attacks(timestamp, len(chain))\n",
    "    inject_amounts = generate_amounts_for_fraud_attacks(len(chain))\n",
    "    random.shuffle(inject_amounts)\n",
    "    for i, idx in enumerate(chain):\n",
    "            original_transaction = transactions[idx]\n",
    "            inject_timestamp = inject_timestamps[i]\n",
    "            original_transaction['datetime'] = inject_timestamp\n",
    "            original_transaction['fraud_label'] = 1\n",
    "            original_transaction['cc_num'] = cc_num\n",
    "            original_transaction['amount'] = inject_amounts[i]\n",
    "            original_transaction['tid'] = generate_transaction_id(inject_timestamp, cc_num, amount)\n",
    "            transactions[idx] = original_transaction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "#### Transform Transactions to Pandas DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "transactions_df = pd.DataFrame(transactions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>tid</th>\n",
       "      <th>datetime</th>\n",
       "      <th>cc_num</th>\n",
       "      <th>amount</th>\n",
       "      <th>fraud_label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>9273</th>\n",
       "      <td>aef43ca7c5a122136d59d13ac21a2edd</td>\n",
       "      <td>2020-01-01 06:13:25</td>\n",
       "      <td>4518462459928874</td>\n",
       "      <td>6.33</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9274</th>\n",
       "      <td>78e9479ce83c905eba8f37c76064e5f2</td>\n",
       "      <td>2020-01-01 06:14:01</td>\n",
       "      <td>4518462459928874</td>\n",
       "      <td>4.58</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9275</th>\n",
       "      <td>f37a788e9ee4ade23eb928ba5b605351</td>\n",
       "      <td>2020-01-01 06:15:27</td>\n",
       "      <td>4518462459928874</td>\n",
       "      <td>53.21</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9276</th>\n",
       "      <td>1fb50a85cc8edf9ea6273da72d4cc4c4</td>\n",
       "      <td>2020-01-01 06:17:13</td>\n",
       "      <td>4518462459928874</td>\n",
       "      <td>0.94</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9277</th>\n",
       "      <td>05e5ee5d1ce1b7bc16bee8a62a9f0a6d</td>\n",
       "      <td>2020-01-01 06:18:00</td>\n",
       "      <td>4518462459928874</td>\n",
       "      <td>51.41</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                   tid             datetime            cc_num  \\\n",
       "9273  aef43ca7c5a122136d59d13ac21a2edd  2020-01-01 06:13:25  4518462459928874   \n",
       "9274  78e9479ce83c905eba8f37c76064e5f2  2020-01-01 06:14:01  4518462459928874   \n",
       "9275  f37a788e9ee4ade23eb928ba5b605351  2020-01-01 06:15:27  4518462459928874   \n",
       "9276  1fb50a85cc8edf9ea6273da72d4cc4c4  2020-01-01 06:17:13  4518462459928874   \n",
       "9277  05e5ee5d1ce1b7bc16bee8a62a9f0a6d  2020-01-01 06:18:00  4518462459928874   \n",
       "\n",
       "      amount  fraud_label  \n",
       "9273    6.33            1  \n",
       "9274    4.58            1  \n",
       "9275   53.21            1  \n",
       "9276    0.94            1  \n",
       "9277   51.41            1  "
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fraud_transactions = transactions_df[transactions_df.fraud_label.eq(1)]\n",
    "fraud_transactions.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "assert fraud_transactions.count()[0] == NUMBER_OF_FRAUDULENT_TRANSACTIONS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "### Save Generated Data\n",
    "<p> The generated raw transactions data will be used by the next step = SageMaker PySpark Processing Job to do aggregations on the raw data columns and derive new features which are useful for model training in the later steps.\n",
    "The generated data is saved locally and then copied to S3 bucket.</p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "#### Save Transactions Data to Local Folder ./DATA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "data_dir = os.path.join(os.getcwd(), 'data/raw')\n",
    "os.makedirs(data_dir, exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [],
   "source": [
    "transactions_df.to_csv(f'{data_dir}/transactions.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "source": [
    "#### Copy Local Data to S3 Bucket"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "button": false,
    "new_sheet": false,
    "run_control": {
     "read_only": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "upload: data/raw/transactions.csv to s3://sagemaker-us-east-1-835319576252/raw/transactions.csv\n"
     ]
    }
   ],
   "source": [
    "!aws s3 cp {data_dir}/transactions.csv s3://{BUCKET}/raw/transactions.csv"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
